今日やること!
使えるgeom関数を増やす
段階的に図をカスタマイズするために、scale_関数、guides()関数、theme()関数を使えるようにする
General social Survey (GSS: 総合的社会調査) からえられた信仰と居住地域の分布のデータをつかう。
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(socviz)
knitr::kable(head(gss_sm)) # こうすると表がいい感じに描画されるよ
| year | id | ballot | age | childs | sibs | degree | race | sex | region | income16 | relig | marital | padeg | madeg | partyid | polviews | happy | partners | grass | zodiac | pres12 | wtssall | income_rc | agegrp | ageq | siblings | kids | religion | bigregion | partners_rc | obama |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016 | 1 | 1 | 47 | 3 | 2 | Bachelor | White | Male | New England | $170000 or over | None | Married | Graduate | High School | Independent | Moderate | Pretty Happy | NA | NA | Aquarius | 3 | 0.9569935 | Gt $170000 | Age 45-55 | Age 34-49 | 2 | 3 | None | Northeast | NA | 0 |
| 2016 | 2 | 2 | 61 | 0 | 3 | High School | White | Male | New England | $50000 to 59999 | None | Never Married | Lt High School | High School | Ind,near Dem | Liberal | Pretty Happy | 1 Partner | Legal | Scorpio | 1 | 0.4784968 | Gt $50000 | Age 55-65 | Age 49-62 | 3 | 0 | None | Northeast | 1 | 1 |
| 2016 | 3 | 3 | 72 | 2 | 3 | Bachelor | White | Male | New England | $75000 to $89999 | Catholic | Married | High School | Lt High School | Not Str Republican | Conservative | Very Happy | 1 Partner | Not Legal | Pisces | 2 | 0.9569935 | Gt $75000 | Age 65+ | Age 62+ | 3 | 2 | Catholic | Northeast | 1 | 0 |
| 2016 | 4 | 1 | 43 | 4 | 3 | High School | White | Female | New England | $170000 or over | Catholic | Married | NA | High School | Not Str Republican | Moderate | Pretty Happy | NA | NA | Cancer | 2 | 1.9139870 | Gt $170000 | Age 35-45 | Age 34-49 | 3 | 4+ | Catholic | Northeast | NA | 0 |
| 2016 | 5 | 3 | 55 | 2 | 2 | Graduate | White | Female | New England | $170000 or over | None | Married | Bachelor | High School | Not Str Democrat | Slightly Liberal | Very Happy | 1 Partner | Legal | Scorpio | 1 | 1.4354903 | Gt $170000 | Age 45-55 | Age 49-62 | 2 | 2 | None | Northeast | 1 | 1 |
| 2016 | 6 | 2 | 53 | 2 | 2 | Junior College | White | Female | New England | $60000 to 74999 | None | Married | NA | High School | Not Str Democrat | Slightly Liberal | Very Happy | 1 Partner | Legal | Scorpio | 1 | 0.9569935 | Gt $60000 | Age 45-55 | Age 49-62 | 2 | 2 | None | Northeast | 1 | 1 |
作業のフローは以下の通り 1. 居住地域と信仰に関する2500人のgss_smデータセットをロードする 1. 地域別の各信仰の信徒数を集計 1. 地域における信徒数のパーセンテージを算出
%>% は、あるデータフレームを別のデータフレームに集計・変換するためのパイプライン操作に使う演算子。データはパイプの片側から入り、dplyrの関数で加工されたあと、もう一方の側から加工済みのデータとして出力される。できることは、
「gss_smデータセットから、region別・religion別の信徒数を求めたい」
rel_by_region <-
# gss_smの2500人のデータを、
gss_sm %>%
# regionとreligごとにデータをグループに分けて、
group_by(bigregion, religion) %>%
# グループ別の人数を集計して、
summarize(N = n()) %>%
# 全体の人数に占める割合とパーセンテージを計算する
mutate(freq = N / sum(N),
pct = round((freq * 100), 0))
## `summarise()` has grouped output by 'bigregion'. You can override using the
## `.groups` argument.
rel_by_region
## # A tibble: 24 × 5
## # Groups: bigregion [4]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast <NA> 1 0.00205 0
## 7 Midwest Protestant 325 0.468 47
## 8 Midwest Catholic 172 0.247 25
## 9 Midwest Jewish 3 0.00432 0
## 10 Midwest None 157 0.226 23
## # … with 14 more rows
「パーセンテージをbigregion別に集計して、100%になることを確認したい」
rel_by_region %>%
# regionごとにデータをグループに分けて、
group_by(bigregion) %>%
# pctの合計が100になることを確認
summarize(total = sum(pct))
## # A tibble: 4 × 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
ではこれをプロットしてみよう。
p <- ggplot(rel_by_region,
aes(x = bigregion,
y = pct,
fill = religion)) +
geom_col(position = 'dodge2') + # dodgeはぴったりくっつく、dodge2はバーの間にスペースが開く
labs(x = 'Region', y = 'Percent', fill = 'Religion') +
theme(legend.position = 'top')
p
バーがたくさん並んでいて混沌とした図を作り替えよう。手始めに、facet関数でbigregionごとにパネルを分けて、宗派をy軸にとり、横棒グラフにしてみるか。
Tips: cood_flip()で、x-y軸を入れ替えることができる
p <- ggplot(rel_by_region,
aes(x = religion,
y = pct,
fill = religion)) +
geom_col(position = 'dodge2') +
labs(x = 'Region', y = 'Percent', fill = 'Religion') +
guides(fill = 'none') +
coord_flip() +
facet_grid(~ bigregion) +
theme(legend.position = 'top')
p
「17カ国のOECD諸国における、移植のための臓器提供意思に関する時系列データをプロットせよ。」
まずは、データセットの中身をランダムに10行分見てみる。
organdata %>%
select(1:6) %>%
slice_sample(n = 10)
## # A tibble: 10 × 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Italy 2002-01-01 18.1 57994 19.2 25569
## 2 Switzerland 1997-01-01 14.3 7089 17.2 27675
## 3 Denmark 1998-01-01 11 5304 12.3 25537
## 4 Germany NA NA NA NA NA
## 5 Ireland 2001-01-01 18.2 3863 5.50 29703
## 6 Sweden 2000-01-01 10.9 8872 1.97 26574
## 7 Finland 1991-01-01 16.8 5014 1.48 17281
## 8 Netherlands 2001-01-01 11.6 16046 38.6 28756
## 9 Switzerland NA NA 6712 16.3 24648
## 10 Italy 1995-01-01 10.1 57301 19.0 20652
Trial 1. まずは年に対するドナーの数を散布図で示してみようか。
Check!: 警告の中身を読むべし。
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors)) +
geom_point()
p
## Warning: Removed 34 rows containing missing values (geom_point).
Trial 2. 次は国別に時系列の変化を可視化してみようか。
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors)) +
geom_line(aes(group = country)) +
facet_wrap(~ country)
p
## Warning: Removed 34 row(s) containing missing values (geom_path).
Trial 3. 国ごとの臓器提供者数のばらつきを可視化してみようか。
p <- ggplot(data = organdata,
mapping = aes(x = country, y = donors)) +
geom_boxplot()
p
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
うわあ、x軸がつぶれている・・・のでx-y軸をflipしてみようか
p <- ggplot(data = organdata,
mapping = aes(x = country, y = donors)) +
geom_boxplot() +
coord_flip()
p
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
データの表示順には拘りたい。今はデフォルトでアルファベット順なので、臓器提供者数が多い順に並べてみるのは良いアイディアかも。
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors)) +
geom_boxplot() +
labs(x = NULL) +
coord_flip()
p
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
色をつけることも可能。
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
fill = world)) +
geom_boxplot() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = 'top')
p
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
violin plotにもすぐ変更できる。
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
fill = world)) +
geom_violin() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = 'top')
p
## Warning: Removed 34 rows containing non-finite values (stat_ydensity).
ここまで試してきたように、カテゴリカル変数をy軸、x軸で連続量の分布を確認することは強力な可視化手法の一つ。
データ数が多くないならば、観測値を直接表示するのも良い。しかし・・・
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world)) +
geom_point() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = 'top')
p
## Warning: Removed 34 rows containing missing values (geom_point).
観測点が重なって見えなくなってしまうので、geom_jitter() 関数で、観測点に揺らぎを持たせる。
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world)) +
geom_jitter() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = 'top')
p
## Warning: Removed 34 rows containing missing values (geom_point).
もうすこし揺らぎを小さくすると見やすい。
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world)) +
geom_jitter(width = 0.15) +
labs(x = NULL) +
coord_flip() +
theme(legend.position = 'top')
p
## Warning: Removed 34 rows containing missing values (geom_point).
さらに、Cleveland dotplotはシンプルで効果的な可視化手法なので試してみよう。
まずは、国別に平均・標準偏差を求めよう。
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebval_mean = mean(cerebvas, na.rm = TRUE))
## `summarise()` has grouped output by 'consent_law'. You can override using the
## `.groups` argument.
by_country
## # A tibble: 17 × 8
## # Groups: consent_law [2]
## consent_law country donors_mean donors_sd gdp_mean health_mean roads_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Australia 10.6 1.14 22179. 1958. 105.
## 2 Informed Canada 14.0 0.751 23711. 2272. 109.
## 3 Informed Denmark 13.1 1.47 23722. 2054. 102.
## 4 Informed Germany 13.0 0.611 22163. 2349. 113.
## 5 Informed Ireland 19.8 2.48 20824. 1480. 118.
## 6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1
## 7 Informed United Kin… 13.5 0.775 21359. 1561. 67.9
## 8 Informed United Sta… 20.0 1.33 29212. 3988. 155.
## 9 Presumed Austria 23.5 2.42 23876. 1875. 150.
## 10 Presumed Belgium 21.9 1.94 22500. 1958. 155.
## 11 Presumed Finland 18.4 1.53 21019. 1615. 93.6
## 12 Presumed France 16.8 1.60 22603. 2160. 156.
## 13 Presumed Italy 11.1 4.28 21554. 1757 122.
## 14 Presumed Norway 15.4 1.11 26448. 2217. 70.0
## 15 Presumed Spain 28.1 4.96 16933 1289. 161.
## 16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3
## 17 Presumed Switzerland 14.2 1.71 27233 2776. 96.4
## # … with 1 more variable: cerebval_mean <dbl>
発展的な内容ですが、すっきりと書くには、こうすると良い。
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize(across(is.numeric, list(mean = mean, sd = sd), na.rm = TRUE))
## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## ℹ Please update your code.
## This message is displayed once per session.
## `summarise()` has grouped output by 'consent_law'. You can override using the
## `.groups` argument.
by_country
## # A tibble: 17 × 28
## # Groups: consent_law [2]
## consent_law country donors_mean donors_sd pop_mean pop_sd pop_dens_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Australia 10.6 1.14 18318. 8.31e2 0.237
## 2 Informed Canada 14.0 0.751 29608. 1.19e3 0.297
## 3 Informed Denmark 13.1 1.47 5257. 8.06e1 12.2
## 4 Informed Germany 13.0 0.611 80255. 5.16e3 22.5
## 5 Informed Ireland 19.8 2.48 3674. 1.32e2 5.23
## 6 Informed Netherlands 13.7 1.55 15548. 3.73e2 37.4
## 7 Informed United Kingd… 13.5 0.775 58187. 6.26e2 24.0
## 8 Informed United States 20.0 1.33 269330. 1.25e4 2.80
## 9 Presumed Austria 23.5 2.42 7927. 1.09e2 9.45
## 10 Presumed Belgium 21.9 1.94 10153. 1.09e2 30.7
## 11 Presumed Finland 18.4 1.53 5112. 6.86e1 1.51
## 12 Presumed France 16.8 1.60 58056. 8.51e2 10.5
## 13 Presumed Italy 11.1 4.28 57360. 4.25e2 19.0
## 14 Presumed Norway 15.4 1.11 4386. 9.73e1 1.35
## 15 Presumed Spain 28.1 4.96 39666. 9.51e2 7.84
## 16 Presumed Sweden 13.1 1.75 8789. 1.14e2 1.95
## 17 Presumed Switzerland 14.2 1.71 7037. 1.70e2 17.0
## # … with 21 more variables: pop_dens_sd <dbl>, gdp_mean <dbl>, gdp_sd <dbl>,
## # gdp_lag_mean <dbl>, gdp_lag_sd <dbl>, health_mean <dbl>, health_sd <dbl>,
## # health_lag_mean <dbl>, health_lag_sd <dbl>, pubhealth_mean <dbl>,
## # pubhealth_sd <dbl>, roads_mean <dbl>, roads_sd <dbl>, cerebvas_mean <dbl>,
## # cerebvas_sd <dbl>, assault_mean <dbl>, assault_sd <dbl>,
## # external_mean <dbl>, external_sd <dbl>, txp_pop_mean <dbl>,
## # txp_pop_sd <dbl>
まずは平均値だけを表示してみる。
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean, na.rm = TRUE),
color = consent_law)) +
geom_point(size = 3) +
labs(x = 'Donor Procurent Rate',
y = '', color = 'Consent Law') +
theme(legend.position = 'top')
p
facetすることもできるが・・・
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean, na.rm = TRUE),
color = consent_law)) +
geom_point(size = 3) +
labs(x = 'Donor Procurent Rate',
y = '', color = 'Consent Law') +
theme(legend.position = 'top') +
facet_wrap(~ consent_law, ncol = 1)
p
全てのx軸のカテゴリが表示されてしまうので、パネルごとに、x/y軸で表示位する範囲を帰るために、scales = ’free’引数を指定する。
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean, na.rm = TRUE),
color = consent_law)) +
geom_point(size = 3) +
labs(x = 'Donor Procurent Rate',
y = '', color = 'Consent Law') +
theme(legend.position = 'top') +
facet_wrap(~ consent_law, scales = 'free_y', ncol = 1)
p
クリーブランドドットプロットは、モデルの要約や、誤差を含んだ結果を示す際に、一般的にたて棒グラフ・横棒グラフよりこのまれる。誤差範囲を示したければ、geom_poitrangeが便利。
p <- ggplot(data = by_country,
mapping = aes(x = reorder(country, donors_mean, na.rm = TRUE),
y = donors_mean)) +
geom_pointrange(aes(ymin = donors_mean - donors_sd,
ymax = donors_mean + donors_sd)) +
labs(x = '',
y = 'Donor Procurent Rate', color = 'Consent Law') +
theme(legend.position = 'top') +
coord_flip()
p
geom_text関数を使うと、こんなこともできる
散布図でプロットした点にラベルをつける
図の中に有益な説明を表記する
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean)) +
geom_point() +
geom_text(mapping = aes(label = country))
p
geom_text関数のhjust引数を調節し、テキストの位置を調整するとよい。
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean)) +
geom_point() +
geom_text(mapping = aes(label = country),
hjust = -0.1)
p
とはいえ調整が難しいし、重なってしまうのは回避できていない。代わりにggrepelパッケージを使うと良いぞ。
library(ggrepel)
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean)) +
geom_point() +
geom_text_repel(mapping = aes(label = country))
p
データ内の関心のある点のみにラベルをつけることを考えよう。
GDP > 25000の国だけラベルをつける。
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean, y = health_mean)) +
geom_point() +
geom_text_repel(data = subset(by_country, gdp_mean > 25000),
mapping = aes(label = country)) +
geom_vline(xintercept = 25000, color = 'grey')
p
GDP > 25000またはhealth < 1500またはcountryが”Belgium”の国だけラベルをつける。
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean, y = health_mean)) +
geom_point() +
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 | health_mean < 1500 |
country %in% c('Belgium')),
mapping = aes(label = country)) +
geom_vline(xintercept = 25000, color = 'grey') +
geom_hline(yintercept = 1500, color = 'grey')
p
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors)) +
geom_point() +
annotate(geom = 'text',
x = 157, y = 33,
label = 'A surprisingly high \n recovery rate.',
hjust = 0)
p
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors)) +
geom_point() +
annotate(geom = 'text',
x = 157, y = 33,
label = 'A surprisingly high \n recovery rate.',
hjust = 0) +
annotate(geom = 'rect',
xmin = 125, xmax = 155,
ymin = 30, ymax = 35,
fill = 'red',
alpha = 0.2)
p
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world)) +
geom_point()
p
## Warning: Removed 34 rows containing missing values (geom_point).
scale_<mapping>_<kind>()というのが、scale関数の命名規則
よく出てくるのは
scale_color_<kind>() / scale_color_<kind>(): 色や塗りつぶしの調整
scale_x_<kind>() / scale_y_<kind>(): x軸やy軸の調整
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world)) +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c('Five', 'Fifteen', 'Twenty Five'))
p
## Warning: Removed 34 rows containing missing values (geom_point).
colorのラベルを変えたい
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world)) +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c('Five', 'Fifteen', 'Twenty Five')) +
scale_color_discrete(labels = c('Corporatist', 'Liberal',
'Social Democratic', 'Unclassified')) +
labs(x = 'Road Deaths', y = 'Donor Procurement', color = 'Welfare State')
p
## Warning: Removed 34 rows containing missing values (geom_point).
ときには凡例を削除したくなることもある。
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world)) +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c('Five', 'Fifteen', 'Twenty Five')) +
scale_color_discrete(labels = c('Corporatist', 'Liberal',
'Social Democratic', 'Unclassified')) +
labs(x = 'Road Deaths', y = 'Donor Procurement', color = 'Welfare State') +
guides(color = 'none')
p
## Warning: Removed 34 rows containing missing values (geom_point).